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With the recent advent of high-throughput genotyping techniques, 
genetic data for genome-wide association studies (GWAS) have be¬ 
come increasingly available, which entails the development of efficient 
and effective statistical approaches. Although many such approaches 
have been developed and used to identify single-nucleotide polymor¬ 
phisms (SNPs) that are associated with complex traits or diseases, 
few are able to detect gene-gene interactions among different SNPs. 
Genetic interactions, also known as epistasis, have been recognized 
to play a pivotal role in contributing to the genetic variation of phe¬ 
notypic traits. However, because of an extremely large number of 
SNP-SNP combinations in GWAS, the model dimensionality can 
quickly become so overwhelming that no prevailing variable selec¬ 
tion methods are capable of handling this problem. In this paper, we 
present a statistical framework for characterizing main genetic effects 
and epistatic interactions in a GWAS study. Specifically, we first pro¬ 
pose a two-stage sure independence screening (TS-SIS) procedure and 
generate a pool of candidate SNPs and interactions, which serve as 
predictors to explain and predict the phenotypes of a complex trait. 

We also propose a rates adjusted thresholding estimation (RATE) 
approach to determine the size of the reduced model selected by an 
independence screening. Regularization regression methods, such as 
LASSO or SCAD, are then applied to further identify important ge¬ 
netic effects. Simulation studies show that the TS-SIS procedure is 
computationally efficient and has an outstanding finite sample per¬ 
formance in selecting potential SNPs as well as gene-gene interac¬ 
tions. We apply the proposed framework to analyze an ultrahigh¬ 
dimensional GWAS data set from the Framingham Heart Study, and 
select 23 active SNPs and 24 active epistatic interactions for the body 
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mass index variation. It shows the capability of our procedure to re¬ 
solve the complexity of genetic control. 


1. Introduction. Genome-wide association studies (GWAS) have been a 
powerful tool for genetic and biomedical research. The past decade has wit¬ 
nessed the rapid development of GWAS and the substantial contributions 
it has made [Altshuler, Daly and Lander (2008); Psychiatric GGGG (2009); 
Hirschhorn (2009); Das et al. (2011)]. With advances in high-throughput 
genotyping techniques and modern statistics, GWAS have been helping in¬ 
vestigators understand the genetic basis of many complex traits or diseases, 
providing valuable clues to the genetic predisposition of common diseases 
and drug responses [Burton et al. (2007); Daly (2010)], among others. 

In a typical GWAS, hundreds of thousands of single-nucleotide polymor¬ 
phisms (SNPs) are usually genotyped on a cohort being studied to identify 
important genetic variants that are associated with the trait of interest. Al¬ 
though fast and inexpensive, the collection of genetic information is normally 
limited to a sample involving hundreds of subjects, which brings statistical 
challenges for estimating and identifying relevant genetic risk factors. With 
SNPs being predictors and phenotypes being the response, single-SNP anal¬ 
ysis is mostly performed. However, such a single-SNP approach is neither 
efficient nor precise, since it fails to consider all SNPs and their possible 
interactions simultaneously, and to adjust the estimated effects accordingly. 
Therefore, many statistical procedures that consider all SNPs jointly have 
been proposed for analyzing the high-dimensional data sets generated by 
genome-wide association studies. 

This is a feature selection problem for high-dimensional data, where the 
number of SNPs (p) is much larger than the number of observations (n). 
Penalized regressions, which are developed to overcome severe drawbacks 
of traditional variable selection techniques, are widely used to select a sub¬ 
set of important predictors from a large number of potential predictors. In 
the GWAS analysis of case-control studies, Wu et al. (2009) and Gho et al. 
(2009) applied LASSO penalized regression [Tibshirani (1996)] and elastic- 
net penalized regression [Zou and Hastie (2005)], respectively. Ayers and 
Cordell (2010) further conducted a comprehensive study to examine the per¬ 
formance of a variety of penalized regressions in case-control studies. They 
concluded that variable selection techniques based on penalized regressions 
outperform single-SNP analysis and stepwise selection. To further explore 
the potential of high-dimensional statistical models for identifying disease 
susceptibility genes, several two-stage approaches have been proposed for se¬ 
lecting significant main effects. Li et al. (2011) employed preconditioning and 
Bayesian LASSO on population cohorts to estimate genetic effects of SNPs 
on continuous traits. He and Lin (2011) developed a GWASelect procedure 
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for case-control cohorts, where several steps of iterative sure independence 
screening (ISIS) and LASSO regression are involved. 

These methods based on penalized regressions have demonstrated their 
statistical power and computational feasibility over the single-SNP analy¬ 
sis. However, since statistical methodologies and computations have already 
been challenged by the overwhelming number of whole-genome SNPs, these 
methods either do not consider gene-gene interactions or estimate inter¬ 
actions among only a small number of selected SNPs with significant main 
effects. However, without considering the full picture of epistatic interactions 
in GWAS analysis, only a limited portion of phenotypic variation can be ex¬ 
plained such that potential disease-associated pathways and risk factors can 
hardly be identified [Manolio et al. (2009); Cordell (2009)]. 

In light of recent developments in machine learning, many sophisticated 
approaches have been proposed to search for whole-genome interactions in 
genome-wide association studies, most of which are designed for case-control 
cohorts. These machine learning approaches include a Bayesian partitioning 
model [Zhang and Liu (2007)], a SNPRuler based on an association rule 
[Wan et al. (2010b)] and random forest approaches [Breiman (2001); Kim 
et al. (2009)]. However, these methods are computationally intensive and 
do not perform well in practice when the genome-wide SNP data are con¬ 
sidered [Wan et al. (2010b); Wang et al. (2011); Szymczak et al. (2009)]. 
More recently, adaptive LASSO [Yang et al. (2010)] and Bayesian general¬ 
ized linear models [Yi, Kaklamani and Pasche (2011)] are applied to detect 
epistatic interactions in case-control cohorts where all SNP pairs are exhaus¬ 
tively searched. Wang et al. (2011) present a comprehensive comparison of 
the prevailing epistatic interaction detection methods, including SNPRuler 
[Wan et al. (2010b)], SNPHarvester [Yang et al. (2009)], Screen and Clean 
[Wu et al. (2010)], BOOST [Wan et al. (2010a)] and TEAM [Zhang et al. 
(2010)]. They concluded that these methods perform differently in terms 
of statistical power, false positive rate and computational cost. However, 
methods other than Screen and Clean are specially designed for case-control 
studies where phenotypic values are binary, and cannot be applied to quan¬ 
titative traits unless the phenotypical values are properly discretized. 

In this paper, we propose a statistical framework for detecting whole- 
genome epistatic interactions in a population cohort where phenotype is 
continuous. The framework incorporates the well-developed penalized re¬ 
gressions, which have proved successful in detecting SNPs with significant 
main effects. Therefore, existing findings regarding penalized regression the¬ 
ories and their empirical performance in GWAS analysis can provide direct 
and valuable insights into our framework. Moreover, the proposed algorithm 
is suitable for parallel computing and does not involve computationally de¬ 
manding techniques on the whole-genome SNP data that prevailing interac¬ 
tion models may involve, such as resampling strategies and Bayesian analy¬ 
sis. As a result, it is computationally efficient. 
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Specifically, we develop a two-stage sure independence screening (TS-SIS) 
procedure before variable selection. The screening step forms a pool of im¬ 
portant SNPs, which may either have significant main effects or demonstrate 
no marginal effects but strong epistatic interactions. Since the two-stage 
screening is based on sure independence screening [Fan and Lv (2008)], the 
computational burden of selecting important interactions is greatly reduced. 
More importantly, this procedure guarantees the performance of the follow¬ 
ing variable selection procedure, in the sense that once important SNPs 
and interactions enter the pool, the probability of identifying the correct 
ones is very high. We also propose a rates adjusted thresholding estima¬ 
tion (RATE) approach to determine the number of predictors retained by 
a variable screening procedure. This approach is based on soft-thresholding 
and bootstrapping, and relates the reduced model size to a false positive 
rate. Ueki and Tamiya (2012) proposed hard-thresholding-based sure in¬ 
dependence screening (SIS) to select promising main genetic effects and 
interactions for penalized regressions. Motivated by the multifactor dimen¬ 
sionality reduction [MDR; Ritchie et al. (2001)] approach, they proposed 
dummy coding methods to effectively capture various patterns of interac¬ 
tions in case-control studies. Our approach, however, is more general and 
suitable for population-based GWAS and other variable screening problems. 

We applied the newly developed statistical framework to analyze a GWAS 
data set from the Framingham Heart Study, aimed to identify genetic vari¬ 
ants that are associated with obesity, blood pressure and heart disease. We 
find that, out of 349,985 SNPs, 23 SNPs and 24 epistatic interactions have 
notable effects on the body mass index (BMI). By applying gene-set enrich¬ 
ment analysis tools [Wang, Li and Bucan (2007); Holden et al. (2008)] in 
future studies, biological knowledge can be integrated to discover and pri¬ 
oritize signaling pathways implied by detected SNPs. Morever, SNP-SNP 
interactions will provide insight into functional related genes and the struc¬ 
ture of genetic pathways, allowing better understanding of complex genetic 
architecture and cellular processes in a system level. 

In Section 2 we introduce the TS-SIS procedure that reduces the model 
dimensionality and identifies potential gene-gene interactions. Section 3 pro¬ 
poses a rates adjusted thresholding estimation (RATE) approach to deter¬ 
mine the number of predictors retained by a general variable screening pro¬ 
cedure. Section 4 shows how penalized regression can fit in this framework 
and gives the estimation procedure for SGAD penalized regression [Fan and 
Li (2001)]. In Section 5 the statistical properties of this framework are in¬ 
vestigated through simulation studies. Section 6 applies this framework to 
the Framingham Heart Study. Concluding remarks are given in Section 7. 

2. Two-stage sure independence screening. In genome-wide association 
studies phenotypical measurements are explained by a handful of covariates 
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and a great number of genetic factors represented by SNP genotypes. To se¬ 
lect important SNPs and estimate their genetic effects precisely by adjusting 
for observed covariates, we employ a GWAS model that takes into account 
the effects of both genetic effects and covariate effects. Moreover, as we dis¬ 
cussed, epistatic interactions play a central role in understanding metabolic 
pathways of complex diseases and traits. Therefore, a comprehensive GWAS 
model incorporating both main genetic effects and gene-gene interactions is 
more appropriate. 

For subject i in a population cohort consisting of a total of n subjects, 
we describe the observed phenotypic value yi as 


Ui — IJ- + Xk,iOtk + Qj,i dj + 

k=l j=l j=l j=lji<j 

P P P P 


(2.1) + X] + X] X] 

j=lj'=l j=lj'=l 


■da 

jj' 


P 

+ X] X] + £i, 

i=i j'<j 


where y is the overall mean, q is the number of nongenetic covariates, p is 
the number of SNPs, Xk,i is the kth covariate for subject i,k = 1,... ,q,i = 
l,...,n, which could be either discrete or continuous, is the effect of 
the feth covariate, aj and dj are the additive effect and dominant effect 
of the jth SNP, respectively, for j = 1,... ,p, is the additive x additive 
epistatic effect between the jth SNP and the j'th SNP, 2^“ and are 
additive x dominant epistatic effect, dominant x additive epistatic effect and 
dominant x dominant epistatic effect, and £i is the residual error assumed to 
follow a N{Q,a‘^) distribution. If an effect is nonzero in the regression model 
(2.1), we say that the corresponding covariate or interaction is active. For 
subject i, and Qj are the indicators of the additive and dominant effects 
of the jth SNP, respectively, which are defined as 


^j,i ~ 


Cj,i — 



if the genotype of SNP j is AA, 
if the genotype of SNP j is Aa, 
if the genotype of SNP j is aa, 
if the genotype of SNP j is Aa, 
if the genotype of SNP j is AA or aa. 


Therefore, the additive effect aj in model (2.1) measures the change of the 
average phenotypic value by substituting allele A with allele a in a pop¬ 
ulation. Dominant effect dj, on the other hand, represents how the effect 
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of allele A is modified by the presence of allele a, allowing a more general 
nonadditive genetic model. 

Given observed phenotypic traits, genetic information and covariates such 
as gender or age, our goal is to characterize the genetic control of the phe¬ 
notype, by selecting active SNPs and gene-gene interactions and estimat¬ 
ing their genetic effects. However, since in GWAS data sets, the number 
of SNPs usually far exceeds the number of subjects, it is almost impos¬ 
sible to directly estimate all genetic effects, as even epistatic interactions 
are not considered in the regression model. Recently, penalized regressions 
that regularize the size of regression coefficients are applied to GWAS mod¬ 
els without interactions, and appropriate algorithms are designed for high¬ 
dimensional inference, such as cyclical coordinate descent methods. But in 
many clinical trials where the number of SNPs is extremely large compared 
with the sample size, the empirical performance of penalized regression is 
not guaranteed. Moreover, if four interaction terms for each SNP pair are 
considered in the GWAS analysis, the estimation of all genetic effects in 
the ultrahigh-dimensional setting is infeasible from the perspective of both 
statistical theories and computational cost. 

To identify this ultrahigh-dimensional model in practice, and to make 
the best use of GWAS data for better explanation and predictions, we need 
to put assumptions on the heredity structures of epistatic effects, although 
we want to make the restrictions as weak as possible. Two versions of the 
effect heredity principle are the following: strong heredity and weak heredity 
[Ghipman (1996)]. Under strong heredity, if the interaction between two 
predictors is significant, both predictors should be marginally signihcant. 
Under weak heredity, only one needs to be marginally signihcant. 

Obviously, in prevailing penalized regression models for GWAS, where 
interaction effects are tested after a subset of signihcant SNPs are selected, 
strong heredity assumption is implicitly imposed. However, throughout this 
paper we will assume only weak heredity, since, in practice, many important 
SNPs are marginally uncorrelated with the response, but interact with other 
SNPs in an epistasis network. With this biologically meaningful assumption 
in the epistatic GWAS model as well as large data sets collected in genome¬ 
wide studies, the potential of GWAS could be fully explored, and a detailed 
picture of genetic control and regulation could be unveiled. 

Two SNPs involved in a two-way interaction will be denoted as “two 
roots.” We will employ a two-stage sure independence screening (TS-SIS) 
procedure to identify SNPs which may have active main effects or may act 
as roots. Sure independence screening is a statistical learning technique for 
ultrahigh-dimensional data proposed by Fan and Lv (2008). In the context of 
GWAS analysis, it ranks the importance of SNPs according to their marginal 
correlations with the response and retains those SNPs whose marginal cor¬ 
relations are strong enough. It can be shown that under some technical 
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conditions, sure independence screening enjoys the sure screening property. 
That is, the reduced model is capable of retaining all the active SNPs with 
asymptotic probability one. 

Let T>a and T>d be two sets of indices of truly important additive effects 
and truly important dominant effects, respectively. The first SIS round will 
be performed between each SNP and the response to select active main 
effects. Since it is common practice to include covariates as linear predictors 
of the response in GWAS analysis, covariates are not subject to SIS and 
will later be added to the reduced model after TS-SIS. After the hrst stage 
of SIS, two subsets of SNPs with potential nonzero additive effects T>a and 
potential nonzero dominant effects are selected. Sure screening property 
[Fan and Lv (2008)] implies that truly important main effects are retained 
in Va and with high probabilities. 

Next, we formulate pairwise epistatic interactions between all SNPs in 
T>a or T>(i and all genome-wide SNPs. In particular, an additive x additive 
interaction term is formulated by taking one SNP from T>a and taking 
any additive effect from all SNPs. The set of additive x additive interac¬ 
tions are denoted by P® = {(j,/) G Paj/ = 1)2,... ,p}. Similarly, 

additive x dominant interactions , dominant x additive interactions P®, 
and dominant x dominant interactions are formulated, and the GWAS 
model becomes 


Vi — T ^ ^ Xk,i^k T ^ ^ A ^ ^ Cj,idj T ^ ^ 
j&Va jeVa 


aa 

'jj' 


k=l 


(2.2) + ^ E 


da 

'jj' 






+ E CjjCj'Ajf+^i- 




After adding interaction terms in the model (2.2), the model dimen¬ 
sionality becomes extremely high compared with GWAS model without 
epistatic interactions. To test whether these interactions contribute to the 
observed variation in phenotypes, we again apply SIS to all interaction terms 
and select epistatic effects that are highly correlated with the response. 
Let Paa be the index set for the selected additive x additive interactions 
between a SNP in P^ and another genome-wide SNP. Similarly, we de¬ 
fine three other sets, Vad, T^da and T>dd-, which contain selected additive x 
dominant, dominant x additive and dominant xdominant interactions, re- 
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spectively. Then the GWAS model after TS-SIS becomes 
<? 

2/i ~T ^ ^ T ^ ^ A ^ ^ Cj,idj T ^ ^ 

je3d ij,j')^'^aa 

(2.3) + (,.iCi'A!f+ E_ 

{j,j')^'5ad (iJOe^da 

+ Y1 0,*CiM^jy+ £*■ 

(j,i')^'S'dd 

Algorithm 1 summarizes the TS-SIS procedure, where the sizes of the 
reduced models in steps 1 and 3 will be determined by the RATE approach 
proposed in Section 3. 


Algorithm 1 Two-stage sure independence screening 

Step 1. Apply the SIS approach to all additive ai^ dominate main effects 
SNPs and estimate the reduced models T>a and T>d. 

Step 2. Formulate pairwise epistatic interactions between all SNPs selected 
in Va or Vd and all genome-wide SNPs T> = {1,2,... ,p}. That is, V^J = 
€ Pa,/ G P}, pS = € Pa,/ G P}, pjj = 

{(//): Cj^fJ G Vd,j' G P}, and P® = {(j,/) : (jCfJ G P^,/ G P}. 

Step 3. Apply the SIS approach again to all epistatic interactions in step 2, 

that is, V^aa , ^^^dd ’ obtain the reduced models Paa, Pad, 

Prfa and Pfifrf. 

Step 4. Combine all reduced models in steps 1 and 3 to obtain the final 
selected model by the TS-SIS procedure: {Pa,Pd,Paa,Pad,Pda,Pdd}- 


Remark. If some nongenetic covariates (such as age) are known as truly 
significant predictors in model (2.1), the following modified independence 
screening procedure can be implemented to improve the performance in 
steps 1 and 3 of Algorithm 1. We run a linear regression of the response on 
each SNP and the significant nongenetic covariates, and utilize the magni¬ 
tude of the SNR’s estimated coefficient as a marginal screening utility.^ 

3. Rates adjusted thresholding estimation. In this section we propose 
a general rule to determine the size of the reduced model selected by an 


®We thank the Associate Editor for suggesting this modified independence screening. 
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independence screening procedure. This rule can be applied to other inde¬ 
pendence screening methods. In its application to the proposed TS-SIS, it 
is equivalent to determining the cardinalities of sets Pq, Paa, Pad) Prfa 
and Vdd- 

In general, the choice of the reduced model size is critical for any inde¬ 
pendence screening approach. If the model size is too large, the following 
penalized regression would be less efficient due to the presence of too many 
noise variables. If the model size is too small, on the other hand, it is likely 
to miss important predictors in the screening stage. Fan and Lv (2008) sug¬ 
gested the reduced model size being proportional to [n/log re] for the SIS 
procedure, where re is the sample size and [•] denotes the integer of a real 
number. Although this hard thresholding is easy to implement in practice, 
little theoretical evidence is provided to guarantee its performance in differ¬ 
ent data sets. Zhu et al. (2011) proposed a soft-thresholding rule by adding 
auxiliary variables in their Sure Independent Ranking and Screening (SIRS) 
procedure for multi-index models with ultrahigh-dimensional covariates. In 
what follows we propose a general data-driven procedure to determine the 
reduced model size that extends the soft-thresholding procedure. 

Denote the ^^-dimensional vector of predictors by x = [Xi ,..., Xp^ ), and 
denote the vector of regression coefficients by 7 = ( 71 ,..., 7 ^^) in a lin¬ 
ear regression model. Let Ai be the set of active predictors and be 
its complement. That is, Ad = {1 < J < Pn: 7 j / 0} and Ai^ = {1 <3 < 
Pn'-lj — 0}- The idea of the soft-thresholding rule in Zhu et al. (2011) is 
as follows. First, d auxiliary variables are generated independently and ran¬ 
domly z = {Zi,..., Zd) r\j Nd{0,ld)- Next, an independence screening pro¬ 
cedure is applied to the combined predictors set (x"*",z"*")"*". Let pk be the 
marginal screening utility between each predictor and the response, where 
A: = 1,2,..., {pn + d). Because z is known to be independent of the response, 
the marginal utility pk between any Z^ and the response is exactly zero 
and the associated sample version pk should be less than any marginal 
utility between the active predictors and the response. Zhu et al. (2011) 
suggested the maximal sample marginal utility of all auxiliary variables, 
Cd = maxi<m<d/Op„+m) as a natural cutoff to separate two sets of active 
and inactive predictors in x. Thus, the selected model is determined by 

Ai = {I <j <pn-.pj> Cd}. 

Although the soft-thresholding procedure may be useful, there are two 
major concerns of practical interest. The first is the choice of the number of 
auxiliary variables d. The larger the d value, the sparser the selected model, 
and thus the higher the probability of missing some active predictors. Be¬ 
sides, a larger d value implies more computation cost. On the other hand, a 
smaller d value gives a smaller cutoff, thus the reduced model dimensionality 
could still be very high. The second concern is how to generate independent 
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auxiliary variables z. The performance of the soft-thresholding rule depends 
on an exchangeability assumption between inactive predictors and auxiliary 
variables assumed in Theorem 3 of Zhu et al. (2011). But its validity is dif¬ 
ficult to check in practice.'^ To address these concerns, we propose a rates 
adjusted thresholding estimation (RATE) approach to determine the num¬ 
ber of auxiliary variables d by bootstrapping auxiliary variables from the 
original data. 

In particular, we propose to relate the number of auxiliary variables d to 
the false positive rate of an independence screening procedure 

which is the proportion of inactive predictors that are incorrectly included in 
the selected model Ad. In data mining and bioinformatics, statistical power 
is also known as sensitivity, and false positive rate is one minus specificity. 
Both sensitivity and specificity are performance measures of interest in ge¬ 
netic association studies [see, e.g., Duggal et al. (2008); Gorlov et al. (2008); 
Harley et al. (2008); Jacobs et al. (2009)]. 

The next theorem provides a lower bound on the probability that the false 
positive rate is controlled under a pre-specified level a. 


Theorem 1. Suppose that the inaetive variables {Xj-.j G Ad^} and 
auxiliary variables {Zk : k = 1,... ,d} are exchangeable in the sense that the 
inactive and auxiliary variables are equally likely to be selected by the inde¬ 
pendence screening procedure. Under the sparsity condition that Sn < n, the 
probability that the false positive rate can be controlled under a pre-specified 
level a is bounded from below. That is, 


(3.1) 


P 


jjwnAd'^ 

lAd'^l 



> 1 - 


ajpn - n) ) 
Pn + d / ■ 


The theorem implies that the probability of the false positive rate being 
controlled below a given level a is greater than 1 — {1 — Given 

a fixed confidence level 1 — /3 = 1 — {1 — the number of auxil¬ 

iary variables d can be determined. According to Theorem 1, we propose 
the RATE procedure in Algorithm 2 for a general independence screening 
method. 


^For instance, both the Editor and Associate Editor mentioned that the distribution 
of “noisy” SNPs is quite different from a normal distribution, so the exchangeability 
assumption may be violated. We thank the Editor and Associate Editor for pointing 
this out. 
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Algorithm 2 Rates adjusted thresholding estimation 

Step 1. Solve the equation {1 — = /3 to obtain d, for given pn, 

n, a and /3. 

Step 2. Bootstrap the original data x = {Xi, ..., Xp^) to obtain d indepen¬ 
dent auxiliary variables z = (Zi, ..., Z^) in the following way. For each i = 
1,... ,n, randomly assign one value of [Xik, .. ■ ■ ■ -^Xnk) 

to Zik, and then get the vector Zk = {Zik, ■. ■, Znk) for k = 1, ... ,d. If 
d> pn, one may iterate the above procedure until getting enough auxil¬ 
iary variables. 

Step 3. Compute the marginal screening utility between each aux¬ 
iliary variable Z^ and the response, k = l,...,d, and set the cutoff 
Cd = may:i<k<dP*k- 

Step 4. Compute the marginal screening utility pj between each predictor 
Xj and the response, j = 1,... ,p„, and select the reduced model as M. = 
{1 <j<Pn-Pj > Cd}. 


We remark that the modihed bootstrapping procedure in step 2 in Algo¬ 
rithm 2 is to guarantee the independence between the response and auxiliary 
variables z. We obtain independent auxiliary variables by bootstrapping the 
original data instead of simulating them from a normal distribution. Conse¬ 
quently, the bootstrapped auxiliary variables have the same data structure 
as the original predictors, approximating the exchangeability condition in 
the soft-thresholding rule. Note that with given pn and n, two rates a and 
/3 together determine the number of auxiliary variables d. Therefore, we call 
this approach the rates adjusted thresholding estimation (RATE). It will 
be shown later that the RATE approach has excellent performance in the 
simulation studies and the real data analysis. 


4. SCAD penalized regression. After two-stage sure independence screen¬ 
ing, the dimensionality of the GWAS model is greatly reduced. In order to 
precisely select important SNPs and epistatic interactions from a pool of 
candidate effects, penalized regressions widely used in main-effect analysis 
could be incorporated here. Specihcally, we put penalties on the sizes of 
additive effects, dominant effects and all epistatic effects and minimize the 
following penalized least squares: 


(4.1) 



Eyf + '^ P\{\aj\) + Px{\dj\) + 
j^'^a j&Vd. 


+ !>a(|i#I)+ fAdi#!). 
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where the penalty function p\{-) is implemented to shrink sufficiently small 
effects to zero and thus exclude the inactive predictors. 

We consider the smoothly clipped absolute deviation (SCAD) penalty 
function due to its unbiasedness, continuity and sparsity properties [Fan 
and Li (2001)]. The SCAD penalty is a nonconvex function and defined as 
follows: 


px{b) = A|6|/(0 < |6| < A) + ^ —(^!±^!1Z^/(A < |6| < aA) 


a — I 


(a + 1)A^ rnul \ ^ 

+ - TT -^( 1^1 > 


where /(•) is an indicator function and a = 3.7 as suggested in Fan and Li 
(2001). A is the tuning parameter which balances the model complexity and 
forecasting performance. We follow the idea of Wang, Li and Tsai (2007) 
and choose A by a BIC tuning parameter selector. 

Commonly-used algorithms for the SCAD penalized least squares include 
the local quadratic approximation (LQA) algorithm [Fan and Li (2001)], the 
perturbed LQA [Hunter and Li (2005)] and the local linear approximation 
(LLA) [Zou and Li (2008)] algorithm. With the aid of LLA, one may employ 
the LARS algorithm to obtain the SCAD estimate. Thus, we will use the 
LLA algorithm in this paper. Specifically, for a given initial value the 
penalty function p\{-) can be locally approximated by a linear function as 


(4.2) Pa(|/3|)«Pa(|/?®|)+p'a(I/ 3^°^I)(I/3|-|/3^°^I) for |/3|«|/3(°)|. 


With the aid of LLA, the estimates of regression coefficients in SCAD pe¬ 
nalized least squares (4.1) can be obtained by minimizing 


1 

2n 


y - 






+ P'x(\df^\)\dj 

j&i’d 


(4.3) 

+ E 






+ E 




Ud')&'^dd 


after constants are discarded. Note that this penalized least squares can be 
easily minimized based on Li penalized regression. 


5. Simulated studies. In this section we investigate the GWAS analysis 
framework consisting of TS-SIS and variable selection through simulation 
studies. We simulate large data sets where SNPs may have either (a) main 
effects or (b) interaction effects. Our goal is to identify these active SNPs 
with high accuracy and low computational cost. 
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Specifically, genotypes of p SNPs across 23 chromosomes are generated 
for n = 500 subjects. For SNP j of subject i,j = 1,... ,p, z = 1,..., n, its 
genotype is derived from Uj^i, where the vector , Up^i) is generated 

from multivariate normal distribution with zero mean and covariance matrix 
^ = i(^j,k)pxp, (^j,k = for p = 0.2, 0.5 or 0.8. Then, we set 





Uij > C\j , 

C2j ^ V^ij ^ C\j , 
Uij <1 C2j 1 


where cij and C 2 j determine the minor allele frequency (MAFs). We consider 
two cases: homogeneous case, MAF = 0.5 for each j, and heterogeneous 
case, in which the MAF of each SNP is randomly set to 0.5, 0.35 or 0.2 
with equal likelihood. Finally, the dominant effect indicator is derived 
from by setting = 1 — In total, there are p = 3948 SNPs across 
23 chromosomes, with the number of SNPs in each chromosome being one 
percent of that in a real data set we are going to work on. 

We put 3 active main effects and 3 active epistatic interactions across 
the whole genome, whose positions and effect sizes are given in Table 1. 
Column “Interact with” in Table 1 indicates, out of 3 active main effects, 
which one the SNP interacts with. When simulating the response variable, 
we standardize the design matrix columnwisely, such that all columns of the 
design matrix have the same variance. This step makes the comparison of 
detecting active main effects and active interactions fair. From Table 1, it can 
be seen that one SNP could interact with two other SNPs without marginal 
effects (three SNPs on chromosomes 2 and 12), and two SNPs involved in a 
two-way interaction may also be correlated (two SNPs on chromosomes 2). 
These interaction patterns add further complexity in the simulation studies. 

For each simulated data set, we hrst implement SIS with the RATE pro¬ 
cedure to select Si SNPs, which may exhibit notable main effects or epistatic 


Table 1 

Information of 6 assumed genetic effects for data simulation 


Chr. 

Position 

Additive/dominant Interact with 

Effect size 

1 

1 

Main effects 

Additive - 

1 

2 

1 

Dominant - 

1 

3 

1 

Additive - 

1 

11 

1 

Epistatic interactions 

Additive 1 

1 

2 

2 

Dominant 2 

1 

12 

1 

Dominant 2 

1 
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effects. We determine si in each simulation according to Theorem 1 with 
a = 0.01 and [3 = 0.0001 and, on average, there are 11 SNPs selected in the 
first stage of TS-SIS. According to the sure screening property, this subset 
of Si SNPs should include the first SNPs on the first 3 chromosomes with 
high probability, which demonstrate active main effects and may serve as 
roots in two-way interactions. 

To select those SNPs that have no marginal effects, but modify the genetic 
effects of other SNPs, two-way interactions are formed between each selected 
SNP in the first stage and any SNP across the genome according to model 
(2). SIS is carried out again, and, in total, S 2 pairs of SNPs are selected. We 
set a = 0.005 and /3 = 0.0001. These SNP pairs should contain all epistatic 
interactions, although they may rank low in terms of the absolute value of 
marginal correlations. Finally, si SNPs with potential main effects and S 2 
SNP pairs enter model (2.3), and variable selections are implemented to 
select important SNPs and estimate their main effects and epistatic effects. 
We consider both LASSO regression and SCAD regression following TS-SIS. 

Table 2 reports the statistical power, false positive rates and computa¬ 
tional time using R code. The result is the average over 100 simulations 
with standard error in parenthesis. In columns labeled “TS-SIS” under 
“Power (%),” we present the statistical power of TS-SIS, or the propor¬ 
tion of active SNPs and interactions that are successfully included in the 
candidate pool of si SNPs and S 2 interactions. In adjacent columns “TS- 
SIS-SCAD” and “TS-SIS-LASSO,” we report statistical powers of two-stage 
SIS paired with SCAD regression or LASSO regression, or the percent of 6 
active SNPs that are correctly identified by the whole procedure. Note that 
the statistical power under “TS-SIS-SCAD” or “TS-SIS-LASSO” cannot be 
greater than that of TS-SIS, since a SNP or an epistatic interaction is con¬ 
sidered by the variable selection procedure only if it is correctly identified by 
TS-SIS. In each column under “False Positive Rate (xl0“'^),” we report the 
false positive rate defined as the proportion of unimportant SNPs that are 
incorrectly identified. We also report the median computing time for TS-SIS 
with penalized regression over all replications. The simulation is conducted 
on a 32-bit windows 7 system, with an Intel (R) 15-2400 processor, 3.10 GHz, 
4G memory. 

According to Table 2, the TS-SIS captures most of the SNPs with active 
main effects, as well as SNPs without main effects but demonstrating active 
interactions. As a result, important SNPs are selected in the reduced model, 
and the majority of irrelevant SNPs are eliminated before variable selection. 
This critical step greatly improves the probability of effectively identifying 
important SNPs and interactions in GWAS analysis. After TS-SIS, SNPs and 
interactions in the reduced model are selected by either SGAD or LASSO. 
As expected, variable selection further reduces the false positive rate and 
increases the interpretability of the final model. Gompared with LASSO, 
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Table 2 

Statistical power, false positive rate and running time of the proposed TS-SIS approach 




Power {%) 


False positive rate (xlO ^) 

Time 

(seconds) 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 



Homogeneous case {MAP = 0.50) 



(0.8, 6) 

97.0 

92.7 

86.5 

4.22 

2.18 

2.92 

12.29 


(8.3) 

(11.7) 

(13.3) 

(2.35) 

(1.09) 

(1.47) 


(0.8, 8) 

95.5 

86.2 

82.3 

4.52 

2.42 

3.19 

13.64 


(9.1) 

(13.4) 

(15.9) 

(2.51) 

(1.23) 

(1.68) 


(0.5, 6) 

97.8 

94.5 

88.2 

2.93 

1.74 

2.29 

10.05 


(6.6) 

(9.5) 

(13.2) 

(1.58) 

(0.88) 

(1.16) 


(0.5, 8) 

95.8 

90.7 

88.5 

2.76 

1.72 

2.21 

11.13 


(8.3) 

(13.0) 

(11.6) 

(1.69) 

(1.01) 

(1.33) 


(0.2, 6) 

95.9 

91.8 

87.7 

2.89 

1.75 

2.27 

7.88 


(9.0) 

(11.7) 

(13.4) 

(1.88) 

(1.05) 

(1.43) 


(0.2, 8) 

95.5 

87.8 

87.1 

2.70 

1.81 

2.28 

9.12 


(9.7) 

(13.6) 

(13.1) 

(1.71) 

(1.08) 

(1.42) 




Heterogeneous case {mixed MAFs) 



(0.8, 6) 

98.17 

89.83 

89.50 

4.71 

2.30 

3.18 

12.41 


(5.24) 

(11.58) 

(11.28) 

(3.51) 

(1.30) 

(1.99) 


(0.8, 8) 

96.50 

85.50 

88.33 

5.02 

2.46 

3.42 

13.15 


(7.22) 

(12.46) 

(11.73) 

(4.17) 

(1.49) 

(2.11) 


(0.5, 6) 

98.17 

91.50 

90.67 

3.47 

1.87 

2.45 

11.17 


(5.24) 

(11.48) 

(11.68) 

(2.93) 

(1.26) 

(1.78) 


(0.5, 8) 

95.00 

87.67 

88.33 

3.08 

1.82 

2.42 

10.38 


(9.91) 

(13.94) 

(14.11) 

(2.36) 

(1.12) 

(1.75) 


(0.2, 6) 

97.67 

90.17 

90.83 

3.08 

1.80 

2.37 

10.59 


(6.28) 

(12.10) 

(11.93) 

(2.15) 

(1.10) 

(1.54) 


(0.2, 8) 

94.33 

87.50 

89.33 

2.58 

1.53 

2.05 

9.25 


(9.54) 

(13.06) 

(11.24) 

(2.33) 

(0.99) 

(1.61) 



SCAD can identify truly important SNPs with higher probability for the 
homogeneous case. As more SNPs have lower MAFs in the heterogeneous 
case, two penalized regressions have comparable statistical powers. In addi¬ 
tion, SCAD delivers smaller false positive rates consistently in all simulation 
scenarios. Table 2 further suggests that, as cr^ decreases, the statistical pow¬ 
ers increase, but the linkage disequilibrium of two SNPs measured by p plays 
a limited role in this setting. Besides, this variable screening procedure is 
very fast even though millions of potential pairwise interactions are present 
in each simulation. 

Table 3 gives the statistical power of detecting interactions, or the average 
proportion of interactions that are selected over 100 simulations. By compar- 
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Table 3 

Statistical power of detecting interactions of the proposed TS-SIS approach 



Homogeneous case 

Heterogeneous case 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 

(0.8, 6) 

93.3 

93.3 

92.7 

97.0 

97.0 

97.0 


(13.4) 

(13.4) 

(14.7) 

(9.6) 

(9.6) 

(9.6) 

(0.8, 8) 

96.7 

96.3 

96.3 

94.7 

94.3 

94.7 


(10.1) 

(10.5) 

(10.5) 

(13.2) 

(14.3) 

(13.2) 

(0.5, 6) 

96.7 

96.7 

96.7 

96.7 

96.7 

96.7 


(10.1) 

(10.1) 

(10.1) 

(10.1) 

(10.1) 

(10.1) 

(0.5, 8) 

92.6 

92.6 

92.6 

91.0 

91.0 

91.0 


(15.1) 

(15.1) 

(15.1) 

(18.3) 

(18.3) 

(18.3) 

(0.2, 6) 

93.5 

93.5 

93.5 

96.0 

96.0 

96.0 


(13.3) 

(13.3) 

(13.3) 

(10.9) 

(10.9) 

(10.9) 

(0.2, 8) 

92.7 

92.7 

92.7 

91.0 

91.0 

91.0 


(16.1) 

(16.1) 

(16.1) 

(17.0) 

(17.0) 

(17.0) 


ing Table 3 with Table 2, it can be seen that interactions are relatively more 
difficult to capture by variable screenings than main effects. This is under¬ 
standable since the number of interaction terms is huge compared with the 
number of main effect terms. Once important interactions are identified by 
TS-SIS, however, they are unlikely to be missed by the following penalized 
regression. As a result, the statistical power of the entire procedure is very 
close to that of TS-SIS. In Table 4 we report the results when the number 
of SNPs is doubled (p = 6996) for MAF = 0.50, with all other specifications 
unchanged. Interestingly, although the statistical power of TS-SIS increases, 
the power of SCAD and LASSO regressions slightly decreases, because the 
same a and /3 in Theorem 1 imply a larger reduced model from TS-SIS.^ 
However, given that the number of interactions increases from about 24.5 
million to about 98 million, the performance of TS-SIS is excellent, as can be 
seen from the increased statistical power and decreased false positive rates. 
In Table 4 we do not change a and /3 for comparison purposes; we consider 
in future research the effects of user-specified rates. 

We also compare this framework with other methods for detecting SNP- 
SNP interactions in simulation studies with MAF = 0.5. Although most of 
the available interaction detection methods are designed for binary phe¬ 
notypes, the Mendel software program [Lange et al. (2001, 2013)] and the 
Screen and Clean (SC) method [Wu et al. (2010)] can identify important 


®On average, the total number of main effects and interactions selected by TS-SIS 
increases from 46.9 to 66.2. 








DETECT GENE-GENE INTERACTIONS IN GWAS 


17 


Table 4 

Statistical power, false positive rate and running time of the proposed TS-SIS approach 
when the number of SNPs is doubled (p = 6996J for MAP = 0.5 




Power (%) 


False positive rate (xlO ^) 

Time 

(seconds) 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 

TS-SIS 

TS-SIS- 

SCAD 

TS-SIS- 

LASSO 

(0.8, 6) 

99.5 

90.0 

83.7 

1.39 

0.74 

0.97 

39.37 


(3.7) 

(12.9) 

(14.2) 

(0.64) 

(0.29) 

(0.40) 


(0.8, 8) 

97.3 

87.8 

84.1 

1.32 

0.78 

1.01 

44.68 


(7.8) 

(14.3) 

(12.7) 

(0.60) 

(0.29) 

(0.39) 


(0.5, 6) 

97.3 

89.8 

85.0 

1.15 

0.68 

0.89 

29.90 


(7.0) 

(11.8) 

(14.3) 

(0.58) 

(0.32) 

(0.41) 


(0.5, 8) 

97.1 

86.3 

83.7 

1.17 

0.74 

0.95 

36.56 


(8.2) 

(14.8) 

(14.9) 

(0.53) 

(0.33) 

(0.42) 


(0.2, 6) 

98.8 

93.8 

87.2 

1.06 

0.67 

0.86 

25.97 


(4.3) 

(10.2) 

(12.5) 

(0.47) 

(0.28) 

(0.36) 


(0.2, 8) 

94.7 

80.7 

84.0 

1.18 

0.77 

0.99 

33.52 


(9.7) 

(15.1) 

(13.8) 

(0.52) 

(0.29) 

(0.38) 



SNPs as well as interactions in GWAS analysis for the quantitative pheno¬ 
type. Moreover, they are scalable and computationally efficient. Specifically, 
Analysis Option 24 in Mendel software is very convenient to test for main 
genetic effects and interaction effects based on marginal p-values or LASSO 
type analysis [Wu and Lange (2008); Wu et al. (2009); Zhou et ah (2010)]. 
Table 5 reports the results from four major analysis options of Mendel: (1) 
marginal analysis for main effects followed by testing important marginal 
effects against all SNPs for interactions (Mendel 1), (2) marginal analysis 
for main effects followed by testing all pairwise interactions among top SNPs 
(Mendel 2), (3) LASSO analysis for main effects followed by testing impor¬ 
tant marginal effects against all SNPs for interactions (Mendel 3), and (4) 
LASSO analysis for main effects followed by testing all pairwise interactions 
among top SNPs (Mendel 4). Since these four analysis options generate final 
models with pre-determined sizes, we use the default model size of 10 for 
main effects and then determine the number of selected interactions in a way 
that the final model size is the same as our method (TS-SIS-SCAD). Table 5 
also reports the performance of the Screen and Clean (SC) method (column 
“SC”) and hard-thresholding-based TS-SIS (column “Hard-SCAD” and col¬ 
umn “Hard-LASSO”), where the first [n/logn] SNPs are selected in TS-SIS. 
Since the final model size of Mendel is user specihed, the false positive rate 
is not reported. 

Among all alternative approaches, Mendel 3 has the best performance fol¬ 
lowed by Mendel 1 and hard-thresholding-based approaches. Both Mendels 
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Table 5 

Statistical power and false positive rate of alternative methods 






Power (%) 



FPR (xlO“ 


Hard- 

SCAD 

Hard- 

LASSO 

SC 

Mendel 

1 

Mendel 

2 

Mendel 

3 

Mendel 

4 

Hard- 

SCAD 

Hard- 

LASSO 

SC 

(0.8, 6) 

89.3 

81.0 

69.9 

85.3 

49.0 

89.3 

49.3 

4.0 

4.4 

4.8 


(9.3) 

(8.5) 

(10.5) 

(8.7) 

(4.0) 

(10) 

(3.3) 

(0.4) 

(0.4) 

(2.7) 

(0.8, 8) 

81.5 

80.5 

61.9 

85.7 

49.7 

88.0 

49.0 

4.2 

4.6 

5.3 


(11.3) 

(11.9) 

(12.0) 

(9.5) 

(2.4) 

(10.1) 

(4.0) 

(0.3) 

(0.4) 

(3.2) 

(0.5, 6) 

87.5 

81.7 

68.8 

83.3 

50.0 

86.7 

49.7 

4.5 

4.8 

5.1 


(10.7) 

(9.6) 

(13.5) 

(6.7) 

(0) 

(9.5) 

(2.4) 

(0.3) 

(0.3) 

(2.8) 

(0.5, 8) 

81 

81.3 

60.5 

83.0 

49.0 

86.0 

49.7 

4.6 

4.9 

5.3 


(11.1) 

(12.8) 

(11.6) 

(7.9) 

(4.0) 

(10.3) 

(2.4) 

(0.3) 

(0.4) 

(3.5) 

(0.2, 6) 

86.2 

81.8 

66.8 

86.0 

49.3 

89.3 

49.3 

4.5 

4.9 

4.8 


(10.1) 

(9.2) 

(12.1) 

(10.3) 

(3.3) 

(10.5) 

(3.3) 

(0.3) 

(0.4) 

(2.2) 

(0.2, 8) 

79.8 

80.2 

63.1 

83.3 

49.3 

85.0 

49.7 

4.6 

5.0 

4.2 


(12.6) 

(14.9) 

(9.2) 

(8.9) 

(3.3) 

(9.7) 

(2.4) 

(0.3) 

(0.3) 

(2.3) 


3 and 1 test the interactions between marginally important SNPs and all 
SNPs, but Mendel 3 selects marginally important SNPs by LASSO regres¬ 
sions and Mendel 1 is based on the conventional marginal analysis. Mendels 
2 and 4 cannot give statistical power greater than 50% since only interac¬ 
tions among top SNPs are considered. In terms of hard-thresholding-based 
TS-SIS procedures (“Hard-SCAD” and “Hard-LASSO”), their performance 
is less satisfactory since too many variables retained after variable screen¬ 
ing lead to a lower statistical power and an inflated false positive rate. But 
similar to Table 2, SCAD regression tends to be associated with a higher 
statistical power and a smaller false positive rate. Last, the Screen and Clean 
method has a low statistical power and a large and unstable false positive 
rate. 

In summary, TS-SIS guided by the RATE approach is effective and effi¬ 
cient in selecting truly important genetic effects and eliminating false posi¬ 
tives for the following penalized regressions. In the context of the ultrahigh¬ 
dimensional GWAS model where a huge number of potential predictors are 
considered, they are recommended in the real data analysis. 

6. Framingham data analysis. We use the newly developed framework 
to analyze a real GWAS data set from the Framingham Heart Study, a 
cardiovascular study based in Framingham, Massachusetts, supported by the 
National Heart, Lung, and Blood Institute and Boston University [Dawber, 
Meadors and Moore (1951)]. Recently, 550,000 SNPs have been genotyped 
for the entire Framingham cohort [Jaquish (2007)], from which 977 unrelated 
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subjects including 418 males and 559 females were randomly chosen for our 
data analysis, conforming to the assumption of population-based GWAS. 
For each subject, body mass index (BMI) is measured at multiple time 
points between age 29 and age 61. We take the first measurement for each 
individual, although the age of receiving the first measurement varies across 
individuals. 

As a common practice in GWAS analysis, SNPs with rare allele frequency 
<10% were excluded from data analysis, which leaves 349,985 SNPs across 23 
chromosomes of the whole genome. 5.16% of the remaining SNPs, however, 
contain missing genotypes for some subjects. Since we are interested in de¬ 
tecting active genetic effects rather than handling missing data in this study, 
for each missing genotype of each subject, we randomly draw a genotype ac¬ 
cording to the SNP’s genotypic frequencies across all subjects whose geno¬ 
types are known. Then, by including gender and age as two covariates, we 
follow the procedure described in previous sections to select SNPs with ac¬ 
tive main effects and construct an epistatic network explaining the observed 
BMI variations. In the RATE assisted TS-SIS procedure, in particular, the 
confidence level is the same as that in simulation studies (/3 = 0.0001), but 
a is set to 0.0005 in screening for main effects and to 0.00001 in detecting 
interactions. 

Out of 349,985 SNPs and numerous two-way interaction terms, 23 active 
main effects and 24 active epistatic interactions are detected by the TS-SIS 
procedure followed by SGAD penalized regression. Then, we refit a linear 
regression model with these selected SNPs and two covariates being predic¬ 
tors, and obtain the estimated regression coefficient and heritability for each 
selected SNP. Tables 6 and 7 tabulate the information of selected SNPs with 
nonzero main and epistatic interaction effects, respectively, including chro¬ 
mosomes, names, minor allele frequencies (MAE), estimated genetic effects 
and heritabilities. Specifically, heritability is the proportion of the pheno¬ 
typic variance explained by the genetic variance of a particular effect. Eor 
an additive or dominant effect, it is calculated as 

,2 _ ‘^PAPajgj + ip A - Pa)djf‘ + {2pAPadjf‘ 
var(y) 

where pA is the allele frequency for A and pa is the allele frequency for a. 
Eor the epistatic interactions, the heritability calculation under our general 
genetic model is more involved. Suppose SNP j has alleles A and a, and SNP 
j' has alleles B and b. Then for genotypes AABB, AABb, AAbb, AaBB, 
AaBb, Aabb, aaBB, aaBb and aabb, the vector of genotype frequencies is 

= {PAPB,‘^PAPBPb,PAPh‘^PAPaPB,^PAPaPBPb, 

o 2 2 2 r, 2 2 2\T 

^PAPaPb ,PaPB^ ^PaPBPb , PaPh ) > 
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Table 6 

Information of SNPs with active main effects in the Framingham Heart Study 



Additive effects 


Dominant effects 




Heritability 


Heritability 

Chr. 

Name MAF Effect 

(%) Chr. 

Name MAF Effect 

(%) 


1 

SS66041272 

0.49 

-0.82 

1.92 

3 

SS66173500 

0.29 

-0.26 

0.35 

1 

SS66276746 

0.13 

-0.42 

0.23 

3 

SS66142093 

0.30 

-0.63 

2.06 

4 

SS66346559 

0.28 

-0.51 

0.60 

4 

SS66354801 

0.27 

0.30 

0.45 

4 

ss66159949 

0.29 

0.12 

0.03 

6 

SS66166806 

0.34 

-0.45 

1.09 

5 

SS66316662 

0.38 

0.37 

0.37 

6 

SS66299053 

0.34 

0.41 

0.91 

5 

SS66118377 

0.50 

0.06 

0.01 

6 

SS66090554 

0.27 

0.24 

0.29 

7 

SS66083530 

0.19 

-0.47 

0.39 

7 

SS66083530 

0.35 

-0.28 

0.43 

8 

SS66177628 

0.23 

-0.01 

0.00 

7 

SS66249128 

0.33 

-0.64 

2.19 

9 

SS66095597 

0.28 

-0.60 

0.83 

7 

SS66314446 

0.21 

0.62 

1.70 

12 

SS66086159 

0.36 

-0.45 

0.53 

8 

SS66381612 

0.27 

-0.32 

0.51 

21 

SS66511535 

0.16 

-0.44 

0.30 

11 

SS66369823 

0.25 

0.21 

0.21 






13 

SS66487154 

0.30 

0.38 

0.75 


and the associated genetic values are 

g = (dj + CLj/ + dj + dji + ~^j 

dj + dji + , dj + dji + j~dj + dj' — Xjj/ 

dj + dji — Ijji,dj — dji — 1.jji, —dj — dji + ) . 

Therefore, the genetic variance is — (cj^g)^, and the epistatic variance 
is this genetic variance minus the genetic variances of two main effects. 
Finally, the associated epistatic heritability is the epistatic variance divided 
by the phenotypic variance. If dominant effects are not modeled, this formula 
gives exactly the same result as the one proposed in Wu and Zhao (2009), 
where two SNPs’ additive effects and their additive x additive interaction 
are considered. 

Generally speaking, main genetic effects contribute to 16.1% of the phe¬ 
notypic BMI variation, among which 5.2% is due to the additive genetic 
effects and 10.9% is due to the dominant genetic effects. Epistasis, on the 
other hand, explains 23.0% of the phenotypic variation. It is worth noting 
that a few SNPs and interactions demonstrate stronger genetic effects than 
others. In other words, although the expression of the BMI trait is deter¬ 
mined by many SNPs, there exist some SNPs that may be more influential. 
For example, out of the 23 SNPs exhibiting significant additive or dominant 
effects, five have heritabilities greater than 1%. This number increases to six 
for epistatic interactions. 
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Table 7 

Information of SNPs with significant interactions in the Framingham Heart Study 



Root 1 



Root 2 



Heritability 

(%) 

Chr. 

Name 

MAF 

Chr. 

Name 

MAF 

Effect 



Additive x additive interactions 



1 

SS66041272 

0.49 

6 

SS66061582 

0.21 

-0.95 

2.08 

9 

SS66095597 

0.28 

4 

SS66151090 

0.11 

-0.80 

0.89 



Additive x 

dominant interactions 



1 

SS66137441 

0.49 

17 

SS66248774 

0.49 

1.06 

1.70 

3 

SS66081331 

0.28 

3 

SS66142093 

0.35 

0.56 

0.30 

3 

SS66375852 

0.38 

23 

SS66107600 

0.33 

-1.13 

0.83 

4 

SS66159949 

0.29 

11 

SS66132273 

0.38 

0.74 

0.71 

8 

SS66177628 

0.23 

13 

SS66487154 

0.34 

-1.06 

1.32 



Dominant 

X additive interactions 



3 

SS66142093 

0.3 

2 

SS66430035 

0.25 

-1.00 

0.97 

3 

SS66142093 

0.3 

3 

SS66081331 

0.28 

-0.74 

0.61 

3 

SS66142093 

0.3 

3 

SS66483001 

0.30 

-0.33 

0.14 

4 

SS66354801 

0.27 

7 

SS66416257 

0.21 

-0.72 

0.53 

6 

SS66316737 

0.29 

21 

SS66113670 

0.10 

1.42 

2.75 

7 

SS66468842 

0.33 

7 

SS66083530 

0.19 

-0.14 

0.03 

7 

SS66249128 

0.33 

8 

SS66047672 

0.23 

-0.88 

0.79 

15 

SS66058021 

0.38 

1 

SS66325411 

0.17 

1.01 

0.90 



Dominant x dominant interactions 



3 

SS66142093 

0.3 

4 

SS66444506 

0.26 

1.09 

0.89 

3 

SS66142093 

0.3 

8 

SS66468875 

0.39 

0.97 

0.71 

3 

SS66142093 

0.3 

11 

SS66152909 

0.35 

1.08 

0.78 

7 

SS66468842 

0.33 

11 

SS66318229 

0.29 

1.02 

0.68 

7 

SS66249128 

0.33 

12 

SS66451087 

0.14 

2.25 

2.29 

7 

SS66249128 

0.33 

12 

SS66109005 

0.16 

-1.08 

0.61 

11 

SS66369823 

0.25 

10 

SS66482189 

0.42 

1.23 

1.74 

11 

SS66369823 

0.25 

3 

SS66142093 

0.35 

-0.40 

0.15 

18 

SS66306728 

0.3 

16 

SS66394113 

0.13 

1.19 

0.55 


To depict an overall picture of genetic control for BMI by SNP-SNP epis- 
tasis, we draw a web of additive x additive, additive x dominant, dominant x 
additive and dominant x dominant interactions in Figure 1 which shows the 
genomic distribution of SNPs that interact with each other. From this hg- 
ure, we obtain the following interesting results: (1) epistasis appears to be 
distributed randomly throughout the genome, although a few SNPs, such 
as ss66142093 on chromosome 3 and ss66249128 and ss66468842 on chro¬ 
mosome 7 tend to interact with many other SNPs. (2) Active epistasis may 
not be due to interactions between two SNPs, both of which display active 
marginal effects. Of the 24 selected pairs, there are two cases in which both 
SNPs have active marginal effects and there are 14 cases in which only one 
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Fig. 1. A picture of significant SNP-SNP interactions for BMI in the Framingham 
Heart Study. The numbers beside SNPs are chromosome numbers. The SNPs that dis¬ 
play significant additive (A) and dominant (D) effects are indicated by arrows. The pairs 
of SNPs with significant additive x additive, additive x dominant, dominant x additive and 
dominant x dominant interactions are indicated as AA, AD, DA and DD, respectively. 


SNP has an active marginal effect, whereas the counterpart has none. There 
are as many as 8 pairs in which no SNP is active for its marginal effect. 
Notably, the dominant x dominant interaction between SNP ss66249128 on 
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chromosome 7 and SNP ss66451087 on chromosome 12 can explain 2.29% 
of the BMI variation, although the latter is marginally uncorrelated with 
BMI. In the presence of SNP ss66451087, the dominant genetic effect of 
SNP ss66249128 is dramatically impacted (Table 7). 

Since our model allows a large number of SNPs to be analyzed simultane¬ 
ously, the resulting discoveries should be more biologically relevant and sta¬ 
tistically robust than those from traditional single-SNP approaches. For ex¬ 
ample, SNP ss66142093 on chromosome 3 was detected to explain 2.97% her- 
itability. This SNP is near a candidate gene ANAPC13 involved in pathways 
for bone and cartilage development that affects human height and stature 
through cell cycle regulation and mitosis [Weedon and Frayling (2008)]. 

In other GWAS for BMI [Frayling et al. (2007); Scuteri et al. (2007); Spe- 
liotes et al. (2010)], significant SNPs were repeatedly detected on chromo¬ 
somes I, 3, 4, 6, 7 and II in NEGRI, ETV5, GNPDA2. BDNF and MTGH2 
loci. Our results of main genetic effects are in agreement with previous re¬ 
ports about the presence of common variants near these loci associated with 
biochemical pathways toward obesity. The result on epistatic interactions 
suggests large epistatic effects among chromosomes 3, 7 and II which have 
not been reported in previous studies, possibly showing the unique power 
of this new approach. A recent review on identifying genes responsible for 
type 2 diabetes confirms genomic regions harboring disease susceptibility 
loci [Erayling et al. (2007)]. These regions include two on chromosome 3 and 
one on each of chromosomes 4, 6, 9 and 12. We have noticed many SNPs 
identified in this study overlap with those detected by previous studies tar¬ 
geting type 2 diabetes, suggesting the underlying correlations between BMI 
and type 2 diabetes. Additionally, our analysis shows that the regression co¬ 
efficients for gender and age are —0.12 and O.OI, respectively. That is, after 
adjusting for these genetic factors, the risk of obesity is higher for females, 
and the risk increases with age. 

To further evaluate the significance and predictability of the proposed 
method, we randomly partition the original real data set into two parts: 
the training data set with 900 subjects and the validation data set with the 
remaining 77 individuals. We apply the proposed RATE assisted TS-SIS 
followed by the SCAD penalized regression to the training data set, and 
then use the validation data set to evaluate the estimated model. Denote 
by Y* the response BMI value of the ith subject in the validation data set, 
and Y* the predicted BMI value by the estimated model using the training 
data set, where i = 1,2,..., 77. We compute the following two criteria to 
evaluate the prediction performance. Eirst, we calculate the relative mean 
absolute prediction error (RMAPE), which is the difference between Y* and 
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Y* divided by the true value of Y*: 

77 , ^ , 

RMAPE = — V ^ 

77 Y* 

Second, we note that a primary interest of predicting BMI is to predict 
whether the individual is obese or not, that is, BMI> 30. Thus, we com¬ 
pute the classihcation accuracy (CA) of the validation data set using the 
estimated model: 

1 

CA=1 - ^ > 30) - ny: >30)1, 

2=1 

where /(•) is an indicator function. Then, we repeat the above validation 
experiment 10 times. The average RMAPE is 14.10%, and the standard 
deviation of RMAPE is 0.85%. The average CA is 82.77%, with a standard 
deviation of 3.37%. These results suggest that our model predicts well in the 
out-of-sample validation data sets. 

7. Discussion. Identifying genetic interaction network is an important 
task in genome-wide association studies, but is challenged by the sheer vol¬ 
ume of genetic data. In this paper we present a comprehensive GWAS model 
and propose a statistical framework to identify important SNPs and interac¬ 
tions which jointly explain the observed phenotypes. Specifically, a two-stage 
sure independence screening procedure (TS-SIS) is proposed to formulate a 
candidate pool of SNPs, including those without weak main effects, but 
serving as a root in two-way interactions. This procedure expands the liter¬ 
ature by relaxing the restrictive assumption that two roots in an interaction 
have to be marginally correlated with the response. A RATE approach is 
also proposed to determine the number of predictors retained in each stage 
of TS-SIS. This approach can also be applied to other variable screening 
problems. 

Wu and Zhao (2009) derived an analytical approach to calculate the power 
of a model selection strategy in GWAS that is similar to the proposed TS- 
SIS. Their approach allows for random genotypes, correlation among test 
statistics as well as a false-positive control. It is straightforward to apply 
their power calculations to our framework. Since the TS-SIS procedure pro¬ 
vided a relatively low-dimensional regression model containing important 
SNPs with high probability, existing penalized least squares estimations and 
their empirical performances in GWAS analysis provided valuable guidance 
for selecting important SNPs and constructing a gene-gene interaction net¬ 
work. 
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The new model has been nsed to analyze GWAS data from the Framing¬ 
ham Heart Study [Dawber, Meadors and Moore (1951)], aimed to identify ge¬ 
netic variants that affect cardiovascular diseases and their related traits such 
as blood pressure and BMI [Jaquish (2007)]. To the best of our knowledge, 
this is likely the first study that has detected genetic interactions for obesity- 
related traits in GWAS. Since the detected SNPs displaying important inter¬ 
actions may be harbored in genes of the BMI-associated metabolic pathways 
[Speliotes et al. (2010)], plus higher heritabilities collectively explained by 
them, our model should provide a powerful and useful tool for understanding 
the underlying genetic mechanisms and regulatory network of obesity. For 
example, dopamine, which is a neurotransmitter, modulates motivation and 
rewarding properties of eating. Wang et al. (2001) confirmed by biomedical 
experiments that brain dopamine levels are significantly lower in the obese 
individuals, suggesting strong correlations between BMI and genetic regu¬ 
latory networks. The use of our model to detect dopamine-associated SNPs 
in a GWAS study should help to unravel the genetic architecture of obesity. 

Our statistical procedure is capable of identifying epistatic interactions 
and enables researchers to decipher a detailed picture of the genetic archi¬ 
tecture of human diseases or complex traits. So far, we have concentrated 
on detecting interactions for a continuous trait in GWAS. The proposed 
TS-SIS assisted SGAD regression can be readily extended to case-control 
cohorts, family trios or survival data analysis in genome-wide association 
studies. The framework can also be applied to other statistical problems, 
where the accurate detection of interactions is desired in the presence of 
high-dimensional data sets or ultrahigh-dimensional data sets. 

APPENDIX 


Proof of Theorem 1. Let any r G A/+, the set of positive integers. 
The event {|AI D > r} represents that at least r unimportant variables 
rank on the top of all auxiliary variables. Because the inactive variables 
{Xj :j G A4'^} and auxiliary variables {Zj. :k = 1,... ,d} are exchangeable, 
we follow the idea of Zhu et al. (2011) and have that 


(AT) 


P{\Mr\M^\>r) < 
< 


< 


{Pn 'Sn)' / {Pn •Sn T d)! 

{Pn - Sn - r)\r\ / {pn - Sn + d - r)\r\ 
{Pn-Sn + d-r) X ■■■ X {pn- Sn + l- r) 

ijPn 'Sn T d) X • • • X (jpn Sn T 1) 



Pn + d 
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= Pn — Sn > Pn — n hy the sparsity principle. If we can assume \M.\ <n, 
it follows that 


Pi < a ) =l-P(|MnM^|>a|M" 


\M'^ 


(A.2) 


> 1 — P{\}A n A4^\ > a{pn — n)) 

d 

, rv\n^ — 77 ,) I 

>i-U- 


a{Pn - n) 

Pn + d 


where the second inequality follows by (A.l). □ 
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